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Abstract. - Common-neighbor-based method is simple yet effective to predict missing links, which 
assume that two nodes are more likely to be connected if they have more common neighbors. In 
such method, each common neighbor of two nodes contributes equally to the connection likelihood. 
In this Letter, we argue that different common neighbors may play different roles and thus lead to 
different contributions, and propose a local naive Bayes model accordingly. Extensive experiments 
were carried out on eight real networks. Compared with the common-neighbor-based methods, 
the present method can provide more accurate predictions. Finally, we gave a detailed case study 
on the US air transportation network. 
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Introduction. — The problem of link prediction aims 
at estimating the likelihood of the existence of a link be- 
tween two nodes in a given network, based on the observed 
links [1]. Recently, the study of link prediction has at- 
tracted much attention from disparate scientific commu- 
nities. In the theoretical aspect, accurate prediction in- 
deed gives evidence to some underlying mechanisms that 
drives the network evolution [2]. Moveover, it is very pos- 
sible to build a fair evaluation platform for network mod- 
eling under the framework of link prediction, which might 
be interested by network scientists [1,3]. In the practi- 
cal aspect, for biological networks such as protein-protein 
interaction networks and metabolic networks [4-6] , the ex- 
periments of uncovering the new links or interactions are 
costly, and thus to predict in advance and focus on the 
links most likely to exist can sharply reduce the experi- 
mental costs [7]. In addition, some of the representative 
methods of link prediction have been successfully applied 
to address the classification problem in partially labeled 
networks [8,9], as well as identified the spurious links re- 
sulted from the inaccurate information in the data [10]. 

Motivated by the theoretical interests and practical sig- 
nificance, many methods for link predication have been 
proposed. Therein some algorithms are based on Markov 
chains [11-13] and machine learning [14, 15]. Another 
group of algorithms are based on node similarity [1, 16]. 



Common Neighbors (CN) is one of the simplest similarity 
indices. Empirically, Kossinets and Watts [17] analyzed 
a large-scale social network, suggesting that two students 
having many mutual friends are very probable to be friend 
in the future. Also in online society like Facebook, the 
users tend to be friend if they have common friends and 
therefore form a community. Extensive analysis on dis- 
parate networks suggests that CN index is of good perfor- 
mance on predicting the missing links [16, 18]. Lii et al. 
[19] suggested that in a network with large clustering coef- 
ficient, CN can provide competitively accurate predictions 
compared with the indices making use of global informa- 
tion. Very recently, Cui et al. [20] revealed that the nodes 
with more common neighbors are more likely to form new 
links in a growing network. The basic assumption of CN is 
that two nodes are more likely to be connected if they have 
more common neighbors. Simply counting the number 
of common neighbors indicates that each common neigh- 
bor gives equal contribution to the connection likelihood. 
However, sometimes different common neighbors may play 
different roles. For instance, the common close friends of 
two people who don't know each other may contribute 
more to their possibly future friendship than their com- 
mon nodding acquaintances. In this Letter, we propose a 
probabilistic model based on the Bayesian theory, called 
Local Naive Bayes (LNB) model, to predict the missing 
links in complex networks. Based on the LNB model, 
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two node pairs with exactly the same number of common 
neighbors may have much different connection likelihoods. 
Experiments on eight real networks demonstrate that our 
method can effectively identify the different roles of com- 
mon neighbors to the connection likelihood and thus give 
more accurate prediction than CN. 

Problem Description. — Consider an undirected 
network G(V, E), where V and E are the sets of nodes and 
links, respectively. The multiple links and self-connections 
are not allowed. Each nonexistent link, namely a link 
(x,y) G U — E where x, y G V and U denotes the uni- 
versal set, will be assigned a score to quantify its exis- 
tence likelihood. Higher score means higher probability 
that nodes x and y are connected, and vice versa. All the 
nonexistent links are sorted in descending order according 
to their scores, and the links at the top are most likely to 
exist. To test the algorithm's accuracy, the observed links, 
E, are randomly divided into two parts: the training set, 
E T , is treated as known information, while the probe set, 
E p , is used for testing and no information therein is al- 
lowed to be used for prediction. Clearly, E — E T U E p 
and E T n E p = cf>. In this Letter, the training set al- 
ways contains 90% of links, and the remaining 10% of 
links constitute the probe set. Hereinafter, the links in 
E p are called missing links and the links in U — E T are 
called non-observed links. 

We apply two standard metrics to quantify the predic- 
tion accuracy: AUC (area under the receiver operating 
characteristic curve) [21] and precision [22]. The AUC 
evaluates the algorithms performance according to the 
whole list. Provided the rank of all non-observed links, 
AUC can be interpreted as the probability that a ran- 
domly chosen missing link is given a higher score than a 
randomly chosen nonexistent link. In the implementation, 
among n times of independent comparisons, if there are n' 
times the missing link having higher score and n" times the 
missing link and nonexistent link having the same score, 
the AUC value is: 



AUC 



0.5n" 



(1) 



If all the scores are generated from an independent and 
identical distribution, AUC should be about 0.5. There- 
fore, the degree to which the value exceeds 0.5 indi- 
cates how much better the algorithm performs than pure 
chance. Different from AUC, precision only focuses on the 
L links with top ranks or highest scores. It is defined as 
the ratio of relevant items selected to the items selected. 
Among the top-L links, if L r links are accurately predicted 
(i.e., there are L r links in the probe set), then the preci- 
sion equals L r /L. Clearly, higher precision means higher 
prediction accuracy. 

Method. — Given a pair of disconnected nodes (x,y) 
(i.e., a non-observed link), our task is to calculate the 
probability of connecting these two nodes on the basis of 
the condition: x and y might have a group of common 



neighbors, each of which has a couple of conditional prob- 
abilities corresponding to encouraging and hampering the 
connections of its two neighbors (see Eqs. (6) and (7)). 

The Naive Bayes Classifier. A naive Bayes classifier 
is a simple probabilistic classifier based on Baycsian the- 
ory with strong (naive) independence assumptions that 
the presence (or absence) of a particular feature of a class 
is unrelated to the presence (or absence) of any other fea- 
ture [23] . Abstractly, the probability model for a classifier 
is a conditional model P(C\Fi, • • • , F n ) where C is a de- 
pendent class variable and F\, F2, ■ ■ ■ , F n are feature vari- 
ables. According to the Bayesian theory 1 , the posterior 
probability P(C\F 1 ,F 2 , ■ ■ ■ ,F n ) is 



P(C\F u F 2 ,---,F n ) 



P(C)-P(F 1 ,F 2 ,---,F n \C) 
P(F 1 ,F 2 ,---,F n ) 



■ (2) 



Consider the naive assumption that each feature F^ is con- 
ditionally independent to every other feature Fj (j ^ i), 
then we have 



P(C\F U F 2 , 



<* n) - P(F u F 2 ,-..,F n ) ■ W 



Local Naive Bayes Model. Given a training set 
G(V,E T ), link prediction questions which links are more 
likely to exist among all the non-observed links. Denote 
by Ai and A the class variables of connection and dis- 
connection respectively. The prior probabilities of Ai and 
Ao can be calculated through 



P{Ai) = 



M 



P(A ) = 



M — M T 



where M T = IE 1 



M ' 
and M = \U\ 



(4) 



(5) 



\\v\ ■ (\v\- 

1). Each node w owns two conditional probabilities 
{P(w\A 1 ),P(w\A )}, where P(w\Ai) is the probability 
that node w is the common neighbor of two connected 
nodes, and P(w\Aq) is the probability that node w is the 
common neighbor of two disconnected nodes. According 
to Bayesian theory, these two probabilities are 



P(w\Ai) 



P(w\Ao) 



P(w) ■ P(Ax\w) 
P{M) : 

P{w) ■ P(A \w) 



(6) 



(7) 



^0) 

For a pair of disconnected nodes (x,y), denote by O xy the 
set of their common neighbors that are considered as the 



1 Bayesian theory [24,25] is a probabilistic approach which re- 
lates the conditional and marginal probabilities of events A and 
B, provided that the probability of B does not equal to zero: 
P(A\B) = P(A p^ where P{A) is the prior probability (or 
marginal probability) of A which does not take into account any in- 
formation about B, and P(A\B) is the conditional probability of A 
given B. P(A\B) is also called the posterior probability. 
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Table 1: The basic topological features of the eight networks. \V\ and \E\ are the number of nodes and links. C and r are 
clustering coefficient [29] and assortative coefficient [36], respectively, (k) is the average degree of network, (d) is the average 
shortest distance between node pairs. H denotes the degree heterogeneity defined as H — jki- 



Networks 


USAir 


Yeast 


CE 


PB 


NS 


FW1 


FW2 


FW3 


\v\ 


332 


2375 


297 


1222 


379 


128 


69 


97 


\E\ 


2126 


11693 


2148 


16717 


941 


2106 


880 


1446 


C 


0.749 


0.388 


0.308 


0.361 


0.798 


0.335 


0.552 


0.468 


r 


-0.208 


0.454 


-0.163 


-0.221 


-0.082 


-0.104 


-0.298 


-0.151 


(k) 


12.81 


9.85 


14.46 


27.36 


4.82 


32.90 


25.51 


29.81 


id) 


2.46 


4.59 


2.46 


2.51 


4.93 


1.77 


1.64 


1.69 


H 


3.46 


3.48 


1.80 


2.97 


1.663 


1.231 


2.33 


1.48 



feature variables and assume they are independent to each 
other. Then according to the naive Bayes classifier theory, 
the posterior probability of connection and disconnection 
of node x and y are 



P{0 X y) 



n nMM), 



weOx 



p[MOxv)= t^) n p ha>)- 



(8) 



(9) 



w£O x 



For a given node pair, comparing these two probabilities 
we can obtain whether they are likely to connect. How- 
ever, it can not tell us which non-observed links are more 
likely to exist than the others. In order to compare the 
existence likelihood between the node pairs, we define the 
likelihood score of node pair (x,y) as the ratio of Eq. (8) 
to Eq. (9). Substituting Eqs. (6) and (7), we have 



n 



P(A ) ■ P(A 1 \w) 



P(A ) J£ P{A 1 ).P(A \wY 



(10) 



Indeed P(A\\w) is equal to the clustering coefficient of 
node w, denote by C w that can be calculated by 



P(A 1 \w)=C u 



N Av 



Naw + N Al 



(11) 



where Na w and N Aw are respectively the number of con- 
nected and disconnected node-pairs whose common neigh- 
bors include w. Obviously Na w + N Aw = fc "- x (^-i) ^ 
where k w is the degree of node w. Since P(A 1 \w) + 
P(A \w) = 1, we have 



P(A \w) = l-C u 



N A 



N Aw + N A 



(12) 



Substituting Eqs. (4) (5) and Eqs. (11) (12), the likeli- 
hood score of node pair (x,y) is 



=-- 1 n 



N. 



Air 



w£Ox 



N Aw + l 



(13) 



where s 



p[^4°] = j^t — 1 is a constant for a given train- 
ing set, and thus can be neglected in calculation. Note 



that we here apply the add- one smoothing to prevent the 
score from being equal to 0. Clearly, larger score means 
higher probability that the two nodes are connected. For 
a given node w, we directly define its role function as 



Nav 



1 



N Aw + 1 

Therefore Eq. (13) can be written as 

^xy J j sR w . 

weo xy 



(14) 



(15) 



Clearly, if R w = 1 for all nodes in the network, then the 
score of nodes x and y, r xy , will become a monotone in- 
creasing function of the number of their common neigh- 
bors. In this case, Eq. (15) is equivalent to CN (= lO^I). 
Different common neighbors are generally of different con- 
tributions to the connecting probability, according to the 
corresponding R w . 

It has been pointed out that the common neighbor's de- 
gree play an important role in link prediction. Suppressing 
the contributions of common neighbors with high degrees 
can improve the prediction accuracy [18]. Two indices are 
designed in this way, Adamic-Adar index (AA) [26] and 
Resource Allocation (RA) [18,27]. The scores are defined 
respectively as: 



AA 



w£O x , 



1 



and 



RA 



(16) 



(17) 



With the same motivation, we add an exponent f(k w ) to 
the item sR w in Eq. (15), where / is a function of node's 
degree. Using Log function on both sides, we obtain a 
linear formula of connection likelihood: 



xy 



^2 f(k w )log(sR w ). 

W£0 X y 



(18) 



Here we consider three forms of function /, namely 
f(k w ) = 1, f(k w ) = j^j— and f(k w ) = which are 
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Table 2: The prediction accuracy measured by AUC and precision (top-100) on eight networks. Each value is obtained by 
averaging over 100 implementations with independently random divisions of training set and probe set. 



Index 


USAir 


Yeast 


CE 


PB 


NS 


FW1 


FW2 


FW3 


AUC CN 


0.953 


0.916 


0.848 


0.924 


0.980 


0.606 


0.689 


0.710 


LNB-CN 


0.959 


0.916 


0.862 


0.926 


0.982 


0.694 


0.733 


0.748 


AA 


0.965 


0.916 


0.865 


0.927 


0.984 


0.608 


0.697 


0.713 


LNB-AA 


0.967 


0.916 


0.866 


0.928 


0.984 


0.697 


0.733 


0.750 


RA 


0.972 


0.916 


0.870 


0.928 


0.984 


0.613 


0.704 


0.716 


LNB-RA 


0.972 


0.917 


0.867 


0.929 


0.984 


0.697 


0.730 


0.750 


Precision CN 


0.597 


0.685 


0.131 


0.419 


0.356 


0.087 


0.146 


0.133 


LNB-CN 


0.612 


0.689 


0.138 


0.409 


0.391 


0.106 


0.192 


0.161 


AA 


0.615 


0.699 


0.135 


0.378 


0.527 


0.090 


0.156 


0.139 


LNB-AA 


0.629 


0.703 


0.136 


0.380 


0.528 


0.104 


0.193 


0.161 


RA 


0.630 


0.506 


0.126 


0.247 


0.547 


0.086 


0.169 


0.145 


LNB-RA 


0.633 


0.625 


0.129 


0.259 


0.548 


0.104 


0.196 


0.170 



corresponding to the Local Naive Bayes (LNB) form of 
CN, AA and RA indices, respectively: 

4 NB " CN = |0„|log S + lo g^> ( 19 ) 

4 NB_AA = E i^ logs + logi? ^ ( 2 °) 

weo X y w 
4 NB_RA = E -^-(logs + logi?™). (21) 

Obviously when R w = 1, namely we don't consider the 
diffcrcn roles of common neighbors, the LNB-CN, LNB- 
AA and LNB-RA will degenerate to CN, AA and RA, 
respectively. 

Results. — Eight networks are considered in our ex- 
periments: (i) USAir [28]: The network of US air trans- 
portation system, which contains 332 airports and 2126 
airlines, (ii) C.clcgans (CE) [29]: The neural network of 
the nematode worm C.elegans, in which an edge joins two 
neurons if they are connected by either a synapse or a gap 
junction. This network contains 297 neurons and 2148 
links, (iii) Political Blogs (PB) [30]: A network of the US 
political blogs. The original links are directed, here we 
treat them as undirected links, (iv) Yeast [31]: A protein- 
protein interaction network containing 2617 proteins and 
11855 interactions. Although this network is not well con- 
nected (it contains 92 components), most of nodes belong 
to the giant component, whose size is 2375. (v) NetScience 
(NS) [32]: A network of coauthorships between scientists 
who are themselves publishing on the topic of networks. 
The network contains 1589 scientists, and 128 of which are 
isolated. Here we do not consider those isolated nodes. 
The connectivity of NS is not good, actually, NS is con- 
sisted of 268 connected components, and the size of the 
largest connected component is only 379. (vi) Foodwebl 
(FW1) [33]: A network of foodweb in Florida Bay dur- 
ing wet season, (vii) Foodwcb2 (FW2) [34]: A network 



of foodweb in Everglades Graminoids during wet season, 
(viii) Foodweb3 (FW3) [35]: A network of foodweb in 
Mangrove Estuary during wet season. Here we only con- 
sider the giant component. The basic topological features 
of these eight networks are summarized in Table 1. 

The prediction accuracy, measured by AUC and pre- 
cision, on the eight real networks are shown in Table 2. 
In general, the LNB forms outperform their correspond- 
ing basic forms. It shows that for AUC, except the result 
of RA in CE network, LNB model gives higher accurate 
prediction for all eight networks. Especially the improve- 
ments of the foodwebs are significant. When measured by 
precision, except the result of CN in PB network, LNB 
model also improve the accuracy. In accordance with our 
analysis, CN method assigns equal weight to the common 
neighbors (i.e., R w — 1 for all nodes), while LNB-CN can 
effectively capture the different roles of common neigh- 
bors. 

Case Study. — In this section, we give detailed analy- 
sis on USAir network which contains 332 airports and 2126 
airlines. This network has a very specific structure: the 
hierarchical organization consisting of hubs, local centers 
and small local airports. We rank all the airports accord- 
ing to their degrees in descending order. The top 17 air- 
ports who own about 1/3 of the total degree are defined as 
hubs (Hub), the last 273 airports who also own 1/3 of the 
total degree are local airports (LA), and the rest 42 air- 
ports are local centers (LC). Therefore there are six kinds 
of links: Hub-Hub (97.06%), Hub-LC (72.83%), LC-LC 
(36.24%), LA-Hub (12.5%), LA-LC (2.75%) and LA-LA 
(0.72%). The number in the bracket indicates the con- 
necting probability: the ratio of the number of real links 
to its possibly maximal value. For example, there are 132 
links connecting two hubs among 17 ^ 16 = 136 possible 
links, and thus the connection probability for two hubs is 
97.06%, indicating that hubs are densely connected with 
each other, which is a specific feature of air transportation 
network [37]. Due to this reason, both CN and LNB-CN 
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methods can provide accurate predictions on the missing 
Hub-Hub links. Moveover, we find that although the num- 
ber of common neighbors of (LC,LC) are higher than that 
of (Hub,LC), LCs are more likely to connect with Hubs 
(see 72.83% > 26.24%). Therefore by simply counting the 
number of common neighbors, CN tends to assign higher 
score to (LC,LC) than (Hub,LC) and thus leads to poor 
predictions. Since LBN-CN is sophisticate to identify the 
negative roles of (LC,LC)'s common neighbors, it will de- 
press the score of (LC,LC) and provide more accurate pre- 
dictions on Hub-LC and LC-LC links. Compared with 
the first three kinds of links, the rest three kinds of links 
involving local airports are difficult to predict (see later 
that none of such links are include in top-100 with both 
CN and LNB-CN). Because LAs usually connected with 
Hubs and rarely have common neighbors with other air- 
ports, the nodes pairs involving LAs are given very small 
scores and thus ranked lower. This is a common drawback 
of CN-based methods. 

We further focus on the top-100 node pairs respectively 
ranked by CN and LNB-CN. Fig. 1(a) shows the top-100 
node pairs ranked by CN and their corresponding ranks as- 
signed by LNB-CN, and Fig. 1(b) shows the top-100 node 
pairs ranked by LNB-CN and their corresponding ranks 
assigned by CN. The links who are accurately predicted 
(i.e., g E p ) are labeled by blue dots while the non-existent 
links (i.e., £ U — E) are labeled by red crosses. From Fig. 
1(a), we can see that among top-100 node pairs ranked 
by CN, 11 node pairs are ranked above 100 by LNB-CN, 
within which only 2 of them are predicted right, which 
implies that LNB-CN tends to rank the non-existent links 
lower than CN. Among the rest 9 node pairs, 6 of them are 
LC-LC links and 3 pairs are of type Hub-LC. This result 
further demonstrates that LNB-CN can give more accu- 
rate judgements on the LC-LC and Hub-LC links. Simi- 
larly, in Fig. 1(b), 6 pairs are ranked above 100 by CN, 
within which 5 links are predicted right, which indicates 
that LNB-CN tends to rank missing links higher than CN. 

We take four pairs within the rectangle in Fig. 1 
(a) as examples. Five airports are involved, including 
San Diego Intl (SDI), General Edward Lawrence Logan 
(CELL), Cleveland-Hopkins Intl (CHI), Philadelphia Intl 
(PI) and Orlando Intl (01). Therein, PI, with over 62 air- 
lines is a hub, others are local airports. These four pairs, 
ranked 128, 123, 119 and 114 respectively by LNB-CN, 
are corresponding to four non-existent airlines, namely 
(GELL, SDI), (CHI, SDI), (OI, SDI) and (PI, SDI). Each 
of them has 21 common neighbors and thus be ranked 
as top-82 by CN. Since SDI is located at the west coast 
while other three airports are all at the east of USA, there 
are no direct airlines connecting SDI and other three air- 
ports. Instead the passengers need to transfer at their 
common neighbors (most of which are hubs) . This feature 
can be well captured by LNB-CN. Figure 2 shows the de- 
pendence of common neighbors' R w on its degree for these 
four pairs. Clearly, for all four pairs most of the common 
neighbors play negative roles and thus hamper the con- 
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Fig. 1: Comparison of the top-100 node pairs respectively 
ranked by CN and LNB-CN on USAir network, (a) The top- 
100 node pairs ranked by CN and their corresponding ranks 
assigned by LNB-CN. (b) The top-100 node pairs ranked by 
LNB-CN and their corresponding ranks assigned by CN. "Hit" 
denotes the accurate predicted link (i.e., links in probe set), 
while "Error" indicates the non-existent link. The diagonal is 
presented by a blue solid line. 
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Fig. 2: The dependence of common neighbors' R w on its degree 
for the four pairs. 
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nections. Therefore, LNB-CN will rank this kind of pairs 
lower than CN by assigning them small scores. 

Conclusion. — In this Letter, we proposed a local 
naive Bayes model (LNB) to predict missing links in com- 
plex networks. The advantage of this method is that it 
can well capture the different roles of common neighbors 
and assign them different weights. To test the method, 
we compared three representative local indices, namely 
Common Neighbors (CN), Adamic-Adar (AA) index and 
Resource Allocation (RA) index, with their correspond- 
ing LNB forms. Extensive analysis on eight real networks 
drown from disparate fields shows that for all these three 
indices the LNB forms outperform their corresponding 
original indices. In particular, the improvement is remark- 
able on the foodwebs where the hierarchical structure is 
obvious and there are rare links within the same level. 
Finally, we gave a detailed analysis on US air transporta- 
tion network. Although some pairs of airports have many 
common neighbors, there are no directed airlines connect- 
ing them because of the long geographical distance. LNB 
methods can well capture this feature and thus give more 
accurate predictions. In addition, some researchers found 
that the local clustering property can be utilized to im- 
prove the accuracy of link prediction [38], yet they did 
not give any solid reason about their method, while the 
present model provides a theoretical base on the usage of 
local clustering (see Eq. (14)). 
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